Scraping a Wikipedia Page
First, open this Wikipedia page in Chrome and scroll down till you come to this table:

This is the data we want to extract. Each row corresponds to a date, each column corresponds to one of the US states and each cell has the number of new COVID-19 cases reported on that date in that state. We want to scrape this table from the page and make it into a dataframe.
To start, let’s load a few familiar packages and assign the URL of the webpage to an R variable.
# load the usual packages
library(dplyr)
library(tidyr)
library(ggplot2)
# the URL of the wikipedia page
url <- "https://en.wikipedia.org/w/index.php?title=2020_coronavirus_outbreak_in_the_United_States&oldid=944107102"
We will use a handy R package called rvest for the actual scraping. Let’s install and load that package.
# install the rvest package. This is a one-time operation.
# install.packages("rvest")
# load rvest package
library(rvest)
We next use the read_html function to read in the contents of the Wikipedia page into an R variable called page.
# read the page by calling the read_html function with the URL of the web page
page <- read_html(url)
The entire web page is now contained in the page variable.
A webpage can be a very complicated object and you can spend a lot of time learning about it. For our purposes, however, a simple framework is sufficient: just imagine the webpage as a box which contains many boxes, each of which may contain more boxes, and so on. These boxes are referred to as tags (or elements) in the technical literature.
Any text or numbers you see on a page is ‘inside’ one of these tags. In web scraping, the key challenge is to figure out which tag holds the data you are interested in. Once you identify the right tag, the rest is easy.
The easiest way to do this is to open the webpage in Chrome, right-click and choose ‘Inspect’, as shown below.

A new panel will open up:

Click anywhere in that panel and enter Ctrl-F on Windows (Command-F on the Mac). A little search box will open up.

Type in any text or data from the content you want to scrape. I typed ‘Jan 21’ in the search bar and it showed up in the search results:

The next step is the critical one.
Tables of data in web pages typically sit inside a ‘table’ tag so we scan up from “Jan 21” and stop at the first table tag we come across. It is easy to recognize tags, they have a ‘<’ before their name.

Next to the word ‘table’, we will usually see a ‘class’ or an ‘id’. Here, we see that our table is of class “wikitable”.

Armed with this, we next call the html_nodes function as shown below. This will grab all the tables on the page that are of class ‘wikitable’ since, in general, there may be many of them.
page %>%
# grab all the tables of class 'wikitable'.
html_nodes("table.wikitable")
{xml_nodeset (4)}
[1] <table class="wikitable">\n<caption>Cases in the United States per the CDC <sup id ...
[2] <table class="wikitable" style="text-align:right; font-size:85%;">\n<caption style ...
[3] <table class="wikitable sortable" style="float:left; text-align:center;"><tbody>\n ...
[4] <table class="wikitable sortable plainrowheaders" style="width: 100%;"><tbody>\n<t ...
Looks like there are 4 such tables on the page. Which one is ours?
Like before, we can go back to the web page on Chrome, right-click and choose ‘Inspect’, click anywhere on the pane that appears, enter Ctrl-F on Windows (Command-F on the Mac), and type ‘wikitable’ in the search box. The first search result is this table, which is not the one we want.

Press Enter once and it will take you to the second result (below), which is the one we want.

So we add a line to our code to select just the second table.
page %>%
# grab all the tables of class 'wikitable'.
html_nodes("table.wikitable") %>%
# select the 2nd table
.[[2]]
Now that we have the exact table we want, we convert it into a nice R dataframe using the html_table function and save it in a variable called covid_counts.
page %>%
# grab all the tables of class 'wikitable'.
html_nodes("table.wikitable") %>%
# select the 2nd table
.[[2]] %>%
# convert table into a dataframe and save in the variable 'covid_counts'.
html_table(fill = TRUE) -> covid_counts
Let’s take a look at the first few rows of covid_counts.
head(covid_counts, 3)
The column names are US geographic regions and the names of the states are in the first row of the dataframe. Let’s make the states as the names of the columns and remove that first row.
names(covid_counts) <- covid_counts[1,]
covid_counts <- covid_counts[-1,]
head(covid_counts, 3)
That looks better. Since there are 61 columns and horizontal scrolling is painful, let’s use names to get a quick sense for all the columns.
names(covid_counts)
[1] "Date" "AK" "AZ" "CA" "CO" "HI" "ID" "MT" "NM" "NV" "OR" "UT"
[13] "WA" "WY" "IA" "IL" "IN" "KS" "MI" "MN" "MO" "ND" "NE" "OH"
[25] "OK" "SD" "WI" "AL" "AR" "FL" "GA" "KY" "LA" "MS" "NC" "SC"
[37] "TN" "TX" "VA" "WV" "CT" "DC" "DE" "MA" "MD" "ME" "NH" "NJ"
[49] "NY" "PA" "RI" "VT" "GU" "PR" "VI" "New" "Cml" "New" "Cml" "New"
[61] "Cml"
Looks like the last 6 columns are summaries of the state-level data (you can confirm this by going to the Wikipedia page). We don’t need them so let’s remove them.
covid_counts <- covid_counts[,-(56:61)]
Next, let’s take a look at the last few rows to see if any summaries are lurking there. We can use the tail command. Alternatively, we can click on covid_counts in the Environment tab in RStudio.

A spreadsheet-like window will open up and display covid_counts. If you scroll to the bottom, you will notice 5 summary rows.

Let’s remove these 5 rows using the head function. head usually gets the first few rows but it can be used to get all but the last few, as shown below.
# by using the 'head' function with a negative number n
# you can get all but the last n
# this is handy since you don't need to calculate the # of rows
# and subtract n etc.
covid_counts <- head(covid_counts, -5)
Let’s take another look at covid_counts to see what else needs to be fixed.
head(covid_counts)
A few things that need fixing:
- The entries in the ‘Date’ column are represented as text but we want to convert them to R date objects so that R will treat them correctly (e.g., if we plot the data as a time series, R will know that Feb 1 is right after Jan 31).
- The numbers in the cells are represented as characters. We should convert them to numbers so we can add etc.
- There are lots of blanks. We should convert them to 0s.
The following code takes care of these fixes.
covid_counts %>%
# convert text to legit R date objects
mutate(Date = as.Date(Date, format = "%b%d")) %>%
# convert all columns except the 'Date' column to numeric
mutate_at(.vars = -1, as.numeric) %>%
# replace blanks with zeros
replace(., is.na(.), 0.0) -> covid_counts
head(covid_counts)
This looks much better and is in a form that’s ready for analysis.
This is just a matter of personal preference but I feel that it will be easier to make charts of counts by state etc if we can pivot this table so that it becomes long and narrow.
covid_counts %>%
pivot_longer(cols = -Date,
names_to = "State" ,
values_to = "Counts") %>%
arrange(State, Date)
This table has only 3 columns but many more rows. The ‘counts’ are daily new cases but it will probably be more informative to look at cumulative cases so let’s add that as a new column.
covid_counts %>%
pivot_longer(cols = -Date,
names_to = "State" ,
values_to = "Counts") %>%
arrange(State, Date) %>%
# group by State so that the cumultive
# calculation is for each state
group_by(State) %>%
# use the handy 'cumsum' function to
# "cumulatively sum" :-)
mutate(Cumulative = cumsum(Counts)) %>%
# undo the group-by and save it back
ungroup -> covid_counts
head(covid_counts)
Perfect!
In this “long and narrow” form, the dataset can be easily visualized in many ways. For example, here’s a chart of the growth in COVID-19 cases in the New England states from March 1st.
covid_counts %>%
# select only March data
filter(Date >= "2020-03-01") %>%
# remove the days before the first case was reported
filter(Cumulative > 0) %>%
# filter for New England states
filter(State %in% c('MA',
'CT',
'NH',
'RI',
'VT',
'ME')) %>%
ggplot(aes(x = Date, y = Cumulative, color = State)) +
geom_point() +
geom_line()

Since our objective is to show how to use R to scrape data rather than analyze COVID-19 trends, we will stop here.
One of the advantages of having all this code is that we can open RStudio and run the script daily with the push of a button and see how the COVID-19 numbers are changing in New England.
Scraping Prices from an Amazon Product Page
For our next example, I will show how to scrape a price from an Amazon product page.
Let’s say I want to extract the price from the Amazon page for an Wi-Fi extender I was looking at last week. One possible use for such a scraper is that I can run my R code every day to see how the price changes.

As before, let’s read in the contents of the page into an R variable.
url <- "https://www.amazon.com/gp/product/B010S6SG3S/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1"
# read in the URL into an R variable
page <- read_html(url)
Again, as before, we open up the page on Chrome, right-click, select ‘Inspect’, search for $56.96 on the pane that pops up, and note the tag that contains $56.96.

The name of the tag is ‘span’ and its id is “priceblock_ourprice”. We can use the html_nodes function as before but need to change it a bit.
In general, if the tag you are interested in has a class attribute, use html_nodes("name_of_tag.value_of_class"). If it has an id attribute, use html_nodes("name_of_tag#value_of_id"). Note the . vs the #.
In this case, we need to use html_nodes("span#priceblock_ourprice").
# grab any tags that contain the price
page %>%
html_nodes("span#priceblock_ourprice")
{xml_nodeset (1)}
[1] <span id="priceblock_ourprice" class="a-size-medium a-color-price priceBlockBuying ...
Looks like there’s just one tag that contains our price. Excellent. Now we can ‘reach inside’ the tag and extract its content with the html_text function.
page %>%
# grab any tags that contain the price
html_nodes("span#priceblock_ourprice") %>%
# extract the text content from within the tag
html_text
[1] "$56.96"
Done! We have scraped out the price.
---
title: "A Lightning Guide to Web Scraping Data With R"
date: "March 21, 2020"
author: "Rama Ramakrishnan"
output: html_notebook
---

### Introduction

In the course so far, the datasets we have analyzed have been well-formatted CSV files. As you know from experience, these are very easy to import into R using the `read_csv` command. 

But quite often in real-world projects, the data won't be available as easily. Sometimes the data you are looking for may be embedded in one or more webpages and you may need to **scrape** them out. For *one-off* situations, cutting and pasting a table of numbers from a webpage into Excel can work just fine. But if you need to do this repeatedly or do it across many different sites, a *programmatic* approach is easier. 

To that end, this brief note describes **how to use R to scrape data from a web page**. 

I will first demonstrate how to scrape data from a Wikipedia page that has up-to-date COVID-19 counts for all US states. I learned about its existence from this [blog post](https://rviews.rstudio.com/2020/03/05/covid-19-epidemiology-with-r/) (which also shows, somewhat tersely, how to scrape it). Next, I will show a quick example of how to extract prices from an Amazon product page. 

The approach is general and I hope you find it useful for your future projects, perhaps even for the final project for this course. If you don't have programming experience, web scraping may appear difficult but rest assured, it is not. By the end of this note, you should be able to scrape product prices from Amazon and swoop in to buy when the price drops :-)

Let's get started.

### Scraping a Wikipedia Page

First, open [this Wikipedia page]("https://en.wikipedia.org/w/index.php?title=2020_coronavirus_outbreak_in_the_United_States&oldid=944107102") in Chrome and scroll down till you come to this table:



![](wiki_page.png)

This is the data we want to extract. Each row corresponds to a date, each column corresponds to one of the US states and each cell has the number of *new* COVID-19 cases reported on that date in that state. We want to scrape this table from the page and make it into a dataframe.

To start, let's load a few familiar packages and assign the URL of the webpage to an R variable.

```{r warning=FALSE, message=FALSE}

# load the usual packages
library(dplyr)
library(tidyr)
library(ggplot2)

# the URL of the wikipedia page 
url <- "https://en.wikipedia.org/w/index.php?title=2020_coronavirus_outbreak_in_the_United_States&oldid=944107102"
```



We will use a handy R package called `rvest` for the actual scraping. Let's install and load that package. 



```{r message=FALSE}
# install the rvest package. This is a one-time operation.
# install.packages("rvest")

# load rvest package
library(rvest)
```


We next use the `read_html` function to read in the contents of the Wikipedia page into an R variable called `page`. 

```{r}
# read the page by calling the read_html function with the URL of the web page
page <- read_html(url)
```



The entire web page is now contained in the `page` variable. 

A webpage can be a very complicated object and you can spend a lot of time learning about it. For our purposes, however, a simple framework is sufficient: just imagine the webpage as a box which contains many boxes, each of which may contain more boxes, and so on. These boxes are referred to as **tags** (or elements) in the technical literature.  

Any text or numbers you see on a page is 'inside' one of these tags. **In web scraping, the key challenge is to figure out which tag holds the data you are interested in**. Once you identify the right tag, the rest is easy.

The easiest way to do this is to open the webpage in Chrome, right-click and choose 'Inspect', as shown below.

![](wiki2.png)


A new panel will open up:


![](wiki3.png)


Click anywhere in that panel and enter Ctrl-F on Windows (Command-F on the Mac).  A little search box will open up.

![](wiki4.png)


Type in any text or data from the content you want to scrape. I typed 'Jan 21' in the search bar and it showed up in the search results:     



![](wiki5.png)



The next step is the critical one.

Tables of data in web pages typically sit inside a 'table' tag so we scan **up** from "Jan 21" and stop at the first table tag we come across. It is easy to recognize tags, they have a '<' before their name.

![](wiki6.png)



Next to the word 'table', we will usually see a 'class' or an 'id'. Here, we see that our table is of class "wikitable".

![](wiki7.png)


Armed with this, we next call the `html_nodes` function as shown below. This will grab *all* the tables on the page that are of class 'wikitable' since, in general, there may be many of them. 

```{r}
page %>% 
  # grab all the tables of class 'wikitable'.
  html_nodes("table.wikitable")
```


Looks like there are 4 such tables on the page. Which one is ours?

Like before, we can go back to the web page on Chrome, right-click and choose 'Inspect', click anywhere on the pane that appears, enter Ctrl-F on Windows (Command-F on the Mac), and type 'wikitable' in the search box. The first search result is this table, which is **not** the one we want.

![](wiki8.png)


Press Enter once and it will take you to the second result (below), which *is* the one we want.

![](wiki9.png)


So we add a line to our code to select just the *second* table.



```{r}
page %>% 
  # grab all the tables of class 'wikitable'.
  html_nodes("table.wikitable") %>% 
  # select the 2nd table
  .[[2]]
```


Now that we have the exact table we want, we  convert it into a nice R dataframe using the `html_table` function and save it in a variable called `covid_counts`.

```{r}
page %>% 
  # grab all the tables of class 'wikitable'.
  html_nodes("table.wikitable") %>% 
  # select the 2nd table
  .[[2]] %>% 
  # convert table into a dataframe and save in the variable 'covid_counts'.
  html_table(fill = TRUE) -> covid_counts
```


Let's take a look at the first few rows of `covid_counts`.


```{r}
head(covid_counts, 3)
```


The column names are US geographic regions and the names of the states are in the first row of the dataframe. Let's make the states as the names of the columns and remove that first row.


```{r}
names(covid_counts) <- covid_counts[1,]
covid_counts <- covid_counts[-1,]
head(covid_counts, 3)
```


That looks better. Since there are 61 columns and horizontal scrolling is painful, let's use `names` to get a quick sense for all the columns.

```{r}
names(covid_counts)
```


Looks like the last 6 columns are summaries of the state-level data (you can confirm this by going to the Wikipedia page). We don't need them so let's remove them.

```{r}
covid_counts <- covid_counts[,-(56:61)]
```


Next, let's take a look at the last few rows to see if any summaries are lurking there. We can use the `tail` command. Alternatively, we can click on `covid_counts` in the Environment tab in RStudio.

![](wiki12.png)


A spreadsheet-like window will open up and display `covid_counts`. If you scroll to the bottom, you will notice 5 summary rows.

![](wiki13.png)


Let's remove these 5 rows using the `head` function. `head` usually gets the first few rows but it can be used to get all but the last few, as shown below.


```{r}
# by using the 'head' function with a negative number n
# you can get all but the last n
# this is handy since you don't need to calculate the # of rows
# and subtract n etc.

covid_counts <- head(covid_counts, -5)

```


Let's take another look at `covid_counts` to see what else needs to be fixed.


```{r}
head(covid_counts)
```


A few things that need fixing:

* The entries in the 'Date' column are represented as text but we want to convert them to R date objects so that R will treat them correctly (e.g., if we plot the data as a time series, R will know that Feb 1 is right after Jan 31).
* The numbers in the cells are represented as characters. We should convert them to numbers so we can add etc.
* There are lots of blanks. We should convert them to 0s.

The following code takes care of these fixes.

```{r}
covid_counts %>%
  # convert text to legit R date objects
  mutate(Date = as.Date(Date, format = "%b%d")) %>%
  # convert all columns except the 'Date' column to numeric
  mutate_at(.vars = -1, as.numeric) %>%
  # replace blanks with zeros
  replace(., is.na(.), 0.0) -> covid_counts

head(covid_counts)
```


This looks much better and is in a form that's ready for analysis. 

This is just a matter of personal preference but I feel that it will be  easier to make charts of counts by state etc if we can *pivot* this table so that it becomes long and narrow.

```{r}
covid_counts %>%
  pivot_longer(cols = -Date,
               names_to = "State" ,
               values_to = "Counts") %>%
  arrange(State, Date)
```

This table has only 3 columns but many more rows. The 'counts' are daily *new* cases but it will probably be more informative to look at **cumulative** cases so let's add that as a new column.

```{r}
covid_counts %>%
  pivot_longer(cols = -Date,
               names_to = "State" ,
               values_to = "Counts") %>%
  arrange(State, Date) %>% 
  # group by State so that the cumultive 
  # calculation is for each state
  group_by(State) %>%
  # use the handy 'cumsum' function to
  # "cumulatively sum" :-)
  mutate(Cumulative = cumsum(Counts)) %>%
  # undo the group-by and save it back
  ungroup -> covid_counts

head(covid_counts)
```


Perfect!

In this "long and narrow" form, the dataset can be easily visualized in many ways. For example, here's a chart of the growth in COVID-19 cases in the New England states from March 1st. 

```{r}
covid_counts %>%
  # select only March data
  filter(Date >= "2020-03-01") %>%
  # remove the days before the first case was reported
  filter(Cumulative > 0) %>%
  # filter for New England states
  filter(State %in% c('MA',
                      'CT',
                      'NH',
                      'RI',
                      'VT',
                      'ME')) %>%
  ggplot(aes(x = Date, y = Cumulative, color = State)) +
  geom_point() +
  geom_line() 
```

Since our objective is to show how to use R to scrape data rather than analyze COVID-19 trends, we will stop here. 

One of the advantages of having all this code is that we can open RStudio and run the script daily with the push of a button and see how the COVID-19 numbers are changing in New England.

### Scraping Prices from an Amazon Product Page

For our next example, I will show how to scrape a price from an Amazon product page. 

Let's say I want to extract the price from [the Amazon page for an Wi-Fi extender I was looking at last week](https://www.amazon.com/gp/product/B010S6SG3S/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1). One possible use for such a scraper is that I can run my R code every day to see how the price changes.

![](wiki10.png)

As before, let's read in the contents of the page into an R variable.

```{r}
url <- "https://www.amazon.com/gp/product/B010S6SG3S/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1"
  
# read in the URL into an R variable
page <- read_html(url)
```

Again, as before, we open up the page on Chrome, right-click, select 'Inspect', search for $56.96 on the pane that pops up, and note the tag that contains $56.96.

![](wiki11.png)

The name of the tag is 'span' and its id is "priceblock_ourprice". We can use the `html_nodes` function as before but need to change it a bit.

In general, if the tag you are interested in has a `class` attribute, use `html_nodes("name_of_tag.value_of_class")`. If it has an `id` attribute, use `html_nodes("name_of_tag#value_of_id")`. Note the `.` vs the `#`.

In this case, we need to use  `html_nodes("span#priceblock_ourprice")`.

```{r}
# grab any tags that contain the price
page %>% 
  html_nodes("span#priceblock_ourprice")
```

Looks like there's just one tag that contains our price. Excellent. Now we can 'reach inside' the tag and extract its content with the `html_text` function.

```{r}
page %>% 
  # grab any tags that contain the price
  html_nodes("span#priceblock_ourprice") %>% 
  # extract the text content from within the tag
  html_text
```

Done! We have scraped out the price.

### Conclusion

There are lots of nuances to web scraping and we have barely scratched the surface but this note should be enough for you to get started if you haven't done it before. Note, however, that webpages keep changing and if you need to scrape data over a long time-period, these changes may 'break' your code and you will need to repeat the process we went through to update the code.

For convenience, I have collected all the code in this note as an R script [here](simple_web_scraping.R). Happy scraping!

